{epiprocess} and {epipredict}

R packages for signal processing and forecasting


Daniel J. McDonald and Logan C. Brooks

and CMU’s Delphi Group

CSTE Workshop on Infectious Disease Forecasting — 25 June 2023

Background

  • Covid-19 Pandemic required quickly implementing forecasting systems.

  • Basic processing—outlier detection, reporting issues, geographic granularity—implemented in parallel / error prone

  • Data revisions complicate evaluation

  • Simple models often outperformed complicated ones

  • Custom software not easily adapted / improved by other groups

  • Hard for public health actors to borrow / customize community techniques

{epiprocess}

Basic processing operations and data structures

  • General EDA for “panel data”
  • Calculate rolling statistics
  • Fill / impute gaps
  • Examine correlations
  • Store revision history smartly
  • Inspect revision patterns
  • Find / correct outliers

{epiprocess} Data Structures

epi_df: snapshot of a data set

  • a tibble with a couple of required columns, geo_value and time_value.
  • arbitrary additional columns containing “measured” values, called “signals”
  • additional “keys” that index subsets (age_group, ethnicity, etc.)

epi_df

Represents a snapshot that contains the most up-to-date values of the signal variables, as of a given time.

{epiprocess} Data Structures

epi_archive: collection of epi_dfs

  • full version history of a data set
  • acts like a bunch of epi_dfs — but stored compactly
  • Allows similar funtionality as epi_df but using only data that would have been available at the time

Revisions

Epidemiology data gets revised frequently. (Happens in Economics as well.)

  • We may want to use the data as it looked in the past
  • or we may want to examine the history of revisions.

Revision patterns

{epipredict}

+ Framework for customizing from modular components.

  1. Preprocessor: do things to the data before model training
  2. Trainer: train a model on data, resulting in an object
  3. Predictor: make predictions, using a fitted model object
  4. Postprocessor: do things to the predictions before returning

A very specialized plug-in to {tidymodels}

Making dumb (but useful!) forecasts in epidemiology

  • We want to predict
    • new hospitalizations \(y\),
    • \(h\) days ahead,
    • at many locations \(j\).
  • We’re going to make a new forecast each week.

Flatline forecaster

For each location, predict \[\hat{y}_{j,\ i+h} = y_{j,\ i}\]

AR forecaster

Use an AR model with an extra feature, e.g.: \[\hat{y}_{j,\ i+h} = \mu + a_0 y_{j,\ i} + a_7 y_{j,\ i-7} + b_0 x_{j,\ i} + b_7 x_{j,\ i-7}\]

{epipredict}

A forecasting framework

  • Flatline forecaster
  • AR-type models
  • Backtest using the versioned data
  • Easily create features
  • Quickly pivot to new tasks
  • Highly customizable for advanced users

{epipredict}

Canned forecasters that work out of the box.

You can do a limited amount of customization.

We currently provide:

  • Baseline flat-line forecaster
  • Autoregressive-type forecaster
  • Autoregressive-type classifier

Basic autoregressive forecaster

  • Predict death_rate, 1 week ahead, with 0,7,14 day lags of cases and deaths.
  • Use lm for estimation. Also create “intervals”.
library(epipredict)
jhu <- case_death_rate_subset # grab some built-in data
canned <- arx_forecaster(
  epi_data = jhu, 
  outcome = "death_rate", 
  predictors = c("case_rate", "death_rate")
)

The output is a model object that could be reused in the future, along with the predictions for 7 days from now.

Adjust lots of built-in options

rf <- arx_forecaster(
  epi_data = jhu, 
  outcome = "death_rate", 
  predictors = c("case_rate", "death_rate", "fb-survey"),
  trainer = parsnip::rand_forest(mode = "regression"), # use ranger
  args_list = arx_args_list(
    ahead = 14, # 2-week horizon
    lags = list(c(0:4, 7, 14), c(0, 7, 14), c(0:7, 14)), # bunch of lags
    levels = c(0.01, 0.025, 1:19/20, 0.975, 0.99), # 23 ForecastHub quantiles
    quantile_by_key = "geo_value" # vary q-forecasts by location
  )
)

Do (almost) anything manually

# A preprocessing "recipe" that turns raw data into features / response
r <- epi_recipe(jhu) %>%
  step_epi_lag(case_rate, lag = c(0, 1, 2, 3, 7, 14)) %>%
  step_epi_lag(death_rate, lag = c(0, 7, 14)) %>%
  step_epi_ahead(death_rate, ahead = 14) %>%
  step_epi_naomit()

# A postprocessing routine describing what to do to the predictions
f <- frosting() %>%
  layer_predict() %>%
  layer_threshold(.pred, lower = 0) %>% # predictions/intervals should be non-negative
  layer_add_target_date(target_date = max(jhu$time_value) + 14) %>%
  layer_add_forecast_date(forecast_date = max(jhu$time_value))

# Bundle up the preprocessor, training engine, and postprocessor
# We use quantile regression
ewf <- epi_workflow(r, quantile_reg(tau = c(.1, .5, .9)), f)

# Fit it to data (we could fit this to ANY data that has the same format)
trained_ewf <- ewf %>% fit(jhu)

# examines the recipe to determine what we need to make the prediction
latest <- get_test_data(r, jhu)

# we could make predictions using the same model on ANY test data
preds <- trained_ewf %>% predict(new_data = latest)

Packages are under active development

Thanks:

  • The whole CMU Delphi Team (across many institutions)
  • Optum/UnitedHealthcare, Change Healthcare.
  • Google, Facebook, Amazon Web Services.
  • Quidel, SafeGraph, Qualtrics.
  • Centers for Disease Control and Prevention.
  • Council of State and Territorial Epidemiologists
  • National Sciences and Engineering Research Council of Canada